This guided practical will demonstrate that the tidyverse allows to compute summary statistics and visualize datasets efficiently. This dataset is already stored in a tidy tibble, cleaning steps will come in future practicals.

Those kind of questions are optional

datasauRus package

library(datasauRus)
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✓ ggplot2 3.3.3     ✓ purrr   0.3.4
## ✓ tibble  3.1.2     ✓ dplyr   1.0.6
## ✓ tidyr   1.1.3     ✓ stringr 1.4.0
## ✓ readr   1.4.0     ✓ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
# install.packages("datasauRus")

Explore the dataset

Since we are dealing with a tibble, we can just type

datasaurus_dozen
## # A tibble: 1,846 x 3
##    dataset     x     y
##    <chr>   <dbl> <dbl>
##  1 dino     55.4  97.2
##  2 dino     51.5  96.0
##  3 dino     46.2  94.5
##  4 dino     42.8  91.4
##  5 dino     40.8  88.3
##  6 dino     38.7  84.9
##  7 dino     35.6  79.9
##  8 dino     33.1  77.6
##  9 dino     29.0  74.5
## 10 dino     26.2  71.4
## # … with 1,836 more rows

only the first 10 rows are displayed.

What are the dimensions of this dataset? Rows and columns?
  • base version, using either dim(), ncol() and nrow()
dim(datasaurus_dozen)
## [1] 1846    3
ncol(datasaurus_dozen)
## [1] 3
nrow(datasaurus_dozen)
## [1] 1846
  • tidyverse version
tibble(datasaurus_dozen)
## # A tibble: 1,846 x 3
##    dataset     x     y
##    <chr>   <dbl> <dbl>
##  1 dino     55.4  97.2
##  2 dino     51.5  96.0
##  3 dino     46.2  94.5
##  4 dino     42.8  91.4
##  5 dino     40.8  88.3
##  6 dino     38.7  84.9
##  7 dino     35.6  79.9
##  8 dino     33.1  77.6
##  9 dino     29.0  74.5
## 10 dino     26.2  71.4
## # … with 1,836 more rows
Assign the datasaurus_dozen to the ds_dozen name This aims at populating the Global Environment
ds_dozen <- datasaurus_dozen
Using Rstudio, those dimensions are now also reported within the interface, where?

In the global environment

How many datasets are present?

you want to count the number of unique elements in the column dataset. The function length() returns the length of a vector, such as the unique elements

unique(ds_dozen$dataset) %>% length()
## [1] 13
# n_distinct counts the unique elements in a given vector.
# we use summarise to return only the desired column named n here.
summarise(ds_dozen, n = n_distinct(dataset))
## # A tibble: 1 x 1
##       n
##   <int>
## 1    13

the function count in dplyr does the group_by() by the specified column + summarise(n = n()) which returns the number of observation per defined group.

count(ds_dozen, dataset)
## # A tibble: 13 x 2
##    dataset        n
##    <chr>      <int>
##  1 away         142
##  2 bullseye     142
##  3 circle       142
##  4 dino         142
##  5 dots         142
##  6 h_lines      142
##  7 high_lines   142
##  8 slant_down   142
##  9 slant_up     142
## 10 star         142
## 11 v_lines      142
## 12 wide_lines   142
## 13 x_shape      142

Check summary statistics per dataset

Compute the mean of the x & y column. For this, you need to group_by() the appropriate column and then summarise()

in summarise() you can define as many new columns as you wish. No need to call it for every single variable.

Compute both mean and standard deviation (sd) in one go using across()
ds_dozen %>%
  group_by(dataset) %>%
  # across works with first on which columns and second on what to perform on selection
  # 2 possibilities to select columns
  # summarise(across(where(is.double), list(mean = mean, sd = sd)))
  summarise(across(c(x, y), list(mean = mean, sd = sd)))
## # A tibble: 13 x 5
##    dataset    x_mean  x_sd y_mean  y_sd
##    <chr>       <dbl> <dbl>  <dbl> <dbl>
##  1 away         54.3  16.8   47.8  26.9
##  2 bullseye     54.3  16.8   47.8  26.9
##  3 circle       54.3  16.8   47.8  26.9
##  4 dino         54.3  16.8   47.8  26.9
##  5 dots         54.3  16.8   47.8  26.9
##  6 h_lines      54.3  16.8   47.8  26.9
##  7 high_lines   54.3  16.8   47.8  26.9
##  8 slant_down   54.3  16.8   47.8  26.9
##  9 slant_up     54.3  16.8   47.8  26.9
## 10 star         54.3  16.8   47.8  26.9
## 11 v_lines      54.3  16.8   47.8  26.9
## 12 wide_lines   54.3  16.8   47.8  26.9
## 13 x_shape      54.3  16.8   47.8  26.9
What can you conclude?

They look all similar based on summary stats. The mean and sd are the same in all datasets.

Plot the datasauRus

Plot the ds_dozen with ggplot such the aesthetics are aes(x = x, y = y)

with the geometry geom_point()

the ggplot() and geom_point() functions must be linked with a + sign

ds_dozen %>%
  ggplot(aes(x=x, y =y)) +
  geom_point()

Reuse the above command, and now colored by the dataset column
ds_dozen %>%
  ggplot(aes(x=x, y =y, colour=dataset)) +
  geom_point()

Too many datasets are displayed.

How can we plot only one at a time?

You can filter for one dataset upstream of plotting

ds_dozen %>%
  filter(dataset=='away') %>%
  ggplot(aes(x=x, y =y, colour=dataset)) +
  geom_point()

Adjust the filtering step to plot two datasets

R provides the inline instruction %in% to test if there a match of the left operand in the right one (a vector most probably)

ds_dozen %>%
  filter(dataset %in% c('away', 'bullseye')) %>%
  ggplot(aes(x=x, y =y, colour=dataset)) +
  geom_point()

Expand now by getting one dataset per facet

Facet is applied in order to split the plots and separate datasets according to a variable.

ds_dozen %>%
  filter(dataset %in% c('away', 'bullseye')) %>%
  ggplot(aes(x=x, y =y, colour=dataset)) +
  geom_point() +
  facet_wrap(~ dataset)

Remove the filtering step to facet all datasets
ds_dozen %>%
  ggplot(aes(x=x, y =y, colour=dataset)) +
  geom_point() +
  facet_wrap(~ dataset)

Tweak the theme and use the theme_void and remove the legend
ds_dozen %>%
  ggplot(aes(x=x, y =y)) +
  geom_point() +
  facet_wrap(~ dataset)+
  theme_void()

Are the datasets actually that similar?

No, the summary stats can be misleading

the R package gifski could be installed on your machine, makes the GIF creation faster. gifski is internally written in rust, and this language needs cargo to run. See this article to get it installed on your machine. First install rust before install the R package gifski. Please note, that the animate() step still takes ~ 3-5 minutes depending on your machine.

Install gganimate, its dependencies will be automatically installed.
# install.packages("gganimate")
# install.packages("rust")
# install.packages("gifski")
Use the dataset variable to the transition_states() argument layer
library(gganimate)

ds_dozen %>%
  ggplot(aes(x = x, y = y)) +
  geom_point() +
  # transition will be made using the dataset column
  transition_states(dataset, transition_length = 5, state_length = 2) +
  # for a firework effect!
  shadow_wake(wake_length = 0.05) +
  labs(title = "dataset: {closest_state}") +
  theme_void(14) +
  theme(legend.position = "none") -> ds_anim
# more frames to slow down the animation
ds_gif <- animate(ds_anim, nframes = 500, fps = 10, renderer = gifski_renderer())
ds_gif

anim_save(title_frame = TRUE, "plots/ds.gif")
Visualize the tiny differences in means for both coordinates
  • need to zoom tremendously to see differences. Accumulate all states to better see the motions.
ds_dozen %>%
  group_by(dataset) %>%
  summarise(across(c(x, y), list(mean = mean, sd = sd))) %>% 
  ggplot(aes(x = x_mean, y = y_mean, colour = dataset)) +
  geom_point(size = 25, alpha = 0.6) +
  # zoom in like crazy
  coord_cartesian(xlim = c(54.25, 54.3), ylim = c(47.75, 47.9)) +
  # animate
  transition_states(dataset, transition_length = 5, state_length = 2) +
  # do not remove previous states to pile up dots
  shadow_mark() +
  labs(title = "dataset: {closest_state}") +
  theme_minimal(14) +
  theme(legend.position = "none") -> ds_mean_anim
ds_mean_gif <- animate(ds_mean_anim, nframes = 100, fps = 10, renderer = gifski_renderer())
ds_mean_gif

anim_save("plots/ds_mean.gif")

Conclusion

never trust summary statistics alone; always visualize your data | Alberto Cairo

Authors

from this post